The introduction of low-cost RGB-D sensors has promoted the research in skeleton-based human action recognition. Devising a representation suitable for characterising actions on the basis of noisy skeleton sequences remains a challenge, however. We here provide two insights into this challenge. First, we show that the discriminative information of a skeleton sequence usually resides in a short temporal interval and we propose a simple-but-effective local descriptor called trajectorylet to capture the static and kinematic information within this interval. Second, we further propose to encode each trajectorylet with a discriminative trajectorylet detector set which is selected from a large number of candidate detectors trained through exemplar-SVMs. The action-level representation is obtained by pooling trajectorylet encodings. Evaluating on standard datasets acquired from the Kinect sensor, it is demonstrated that our method obtains superior results over existing approaches under various experimental setups.